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of Network De-anonymization 
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Abstract —Recently, graph matching algorithms have been 
successfully applied to the problem of network de-anonymization, 
in which nodes (users) participating to more than one social 
network are identified only by means of the structure of their 
links to other members. This procedure exploits an initial set of 
seed nodes large enough to trigger a percolation process which 
correctly matches almost all other nodes across the different 
social networks. Our main contribution is to show the crucial role 
played by clustering, which is a ubiquitous feature of realistic 
social network graphs (and many other systems). Clustering has 
both the effect of making matching algorithms more vulnerable 
to errors, and the potential to dramatically reduce the number 
of seeds needed to trigger percolation, thanks to a wave-like 
propagation effect. We demonstrate these facts by considering a 
fairly general class of random geometric graphs with variable 
clustering level, and showing how clever algorithms can achieve 
surprisingly good performance while containing matching errors. 


I. Introduction 

The advent of online social networks, and their massive 
worldwide penetration, can be well considered as one of 
the most influential changes brought by information and 
communication technologies into our lives during the last 
decade, with profound impact on all aspects of economy, 
society and culture. The extraordinary capitalization of the 
companies running these (typically free) online services can 
be explained by the huge amount of valuable information that 
can be extracted from the traces of activities performed by 
billions of users. Such information allows, for example, to 
build user profiles that can be effectively used for targeted 
advertisements, marketing and social surveys, and many other 
profitable business run by service providers and third parties. 
Privacy concerns raised by the collection, analysis and distri¬ 
bution of personal data, exposed more or less consciously by 
active users, have been recently hotly debated in the media. 
User privacy is especially threatened when data collected from 
different systems is combined together to construct richer and 
more accurate user profiles. 

In this work we are specifically concerned with the problem 
of identifying users participating to different online social 
networko We emphasize that this problem can be perceived 
by people in totally different ways. Some users would prefer to 
hide any Personal Identifiable Information (PII) while using a 
service, and they see any attempt to correlate accounts created 
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'More in general, we are interested in any sort of communication system 
assigning some kind of (unique) ID to users, typically as a result of a new 
registration/account creation (including traditional communication services 
such as email and cellular networks). 


in different systems as a severe violation of their privacy. Other 
users instead are more than happy to merge or link together 
their various accounts, as this turns out to be convenient to 
the user itself. For example, ‘social logins’ allow users to 
use existing accounts on social networks to directly sign into 
other services (different applications, websites, public Wi-Fi 
hotspots). 

In our work, we are specifically interested in privacy issues, 
and consider the case of an ‘attacker’ trying to identify users 
belonging to two different social networks (without their 
consent). Recently, security experts have made the dramatic 
discovery that user privacy cannot be guaranteed when traces 
of communication activities are made available after applying 
the simple anonymization procedure which replaces real ID’s 
by random labels |Tj. 

A standard way to formalize the user identification problem 
is the following: each communication system (e.g., a given 
social network) generates (from the traces of user activities) 
a ‘contact graph’ in which nodes represent anonymized users, 
and edges denote who has come in contact with whom. The 
attacker then runs a graph matching algorithm on the contact 
graphs generated by different systems, which in the hardest 
case can make use only of the topologies of these graphs, 
without any additional side information 0- The majority of 
algorithms proposed so far to achieve this goal are facilitated 
by an initial set of already matched nodes (called seeds). This 
is actually a realistic case, since, as explained above, some 
users explicitly link their accounts in different systems ‘for 
free’. Many proposed matching strategies, based on heuristic 
algorithms, work by progressively expanding the set of already 
matched nodes, trying to identify all of the other nodes (T), 
0, 0. In particular, in their seminal paper Narayanan and 
Shmatikov (T) were able to identify a large fraction of users 
having account on both Twitter and Flickr (with only 12% 
error ratio). 

Significant progress has also been made towards theoretical 
understanding of the feasibility of network de-anonymization 
(in the first place), and of the asymptotic performance of 
graph matching algorithms applied to large systems. Recent 
analytical work has adopted the following convenient prob¬ 
abilistic generation model for two contact graphs Q\ and 
Qi\ we consider the (inaccessible) ‘ground-truth’ graph C/t 
representing true social relationships among people, and then 
assume that Q\ is obtained by independently sampling each 
edge of Qj with probability s (similarly, and independently, 
Q’>). Specifically, when the social network Qi is modeled as 
an Erdos-Renyi random graph, it has been shown in 0 that, 
under mild conditions, users participating in two different 
social networks can be successfully matched by an attacker 
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with unlimited computation power, even without seeds. In 
the case of Erdos-Renyi random graphs, in El authors have 
also proposed a practical identification algorithm based on 
bootstrap percolation JTJ and they have shown an interesting 
phase transition phenomenon in the number of seeds that are 
required for network de-anonymization. The results in El have 
been recently extended to the more realistic case in which 
contact graphs are scale-free (power law) random graphs. In 
particular, by modeling them as Chung-Lu graphs, © and © 
have independently shown that a much smaller set of seeds 
is sufficient to trigger the percolation-based matching process 
originally studied in Erdos-Renyi graphs. 

While previous work has captured the impact of power- 
law degree distribution on percolation graph matching, another 
essential feature of real social networks, namely, clustering, 
has not been investigated so far. Interestingly, in El authors 
attempted to apply their basic algorithm also to highly clus¬ 
tered random geometric graphs, observing almost total failure 
(error rates above 50 %). This preliminary finding has been 
the starting point of our work. In this paper we consider a 
fairly general model of random geometric graphs that allows 
us to incorporate various levels of clustering in contact graph, 
without concurrently generating a scale-free structure. By so 
doing, we separate the (unkown) impact of clustering from 
the (known) impact of power law degree, going back to the 
original case of Erdos-Renyi graphs and exploring a totally 
different, ‘orthogonal’ direction. Our main findings are as 
follows: 

(i) Clustered networks can be indeed largely prone to match¬ 
ing errors when we naively apply the method proposed in El- 
Such errors can be mitigated and asymptotically eliminated 
by an improved matching algorithm still based on bootstrap 
percolation; 

(ii) Once errors are eliminated, clustering turns out to have 
a surprising beneficial effect on the performance of graph 
matching, thanks to a wave-like propagation phenomenon that 
allows to progressively identify all nodes starting from a very 
small, compact set of seeds; 

(iii) In contrast with previous results derived for Erdos- 
Renyi and Chung-Lu graphs in EL ©, we show that the min¬ 
imum number of seeds required for network de-anonymization 
increases as the average node degree of the graph grows. 

Our results are qualitatively validated via experiments with 
real social network graphs. We emphasize that, although we 
focus on network de-anonymization, we do not cast our results 
exclusively to this problem. Indeed, the results we derive 
have much broader applicability since graph matching is a 
general problem arising in many different domains, ranging 
from computer graphics to bioinformatics. 

II. Notation and preliminaries 

Without loss of generality, we assume that (7t(V,£), 
G i(Vi,£i) and 02(^2, £2) have the same set of nodes (or 
vertices) with cardinality n, i.e., 1 /j = Vj = V 0 . Similarly 
to previous work a. a, a. a. a we assume that edges 

’illis assumption can be easily removed by considering that only the 
intersection of vertices belonging to Qi and Q2 has to be de-anonymized. 



Fig. 1 . An example of Q] and C/2 obtained from Qj by independent edge 
sampling, and of the pairs graph 'P(St)- Seeds are highlighted in red. In 
'P(Gt)- good pairs are highlighted in white and bad pairs in grey. 

in Q 1 and G2 are obtained by independently sampling each 
edge of Gt with probability s. Specifically, each edge in Gt is 
assumed to be (independently) sampled twice, the first time to 
determine its presence in 8\, the second time to determine its 
presence in 82- This model is a reasonable approximation of 
real systems which permits obtaining fundamental analytical 
insights. 

To match G 1 and G2, we build the pairs graph V(V. 8 ), 
with V C Vj x V 2 and £ C 81 x 82 ■ In V(V■ 8 ) there exists 
an edge between [«i, J2] and [Aq,^] iff edge (i\,ki) £ £\ 
and edge (j2,h) 6 £2- We will slightly abuse the notation 
and denote the pair graph associated to a generic ground-truth 
graph G t simply as 'P(Gt)- Fig. Q] shows the pairs graph built 
from a toy example. 

We will refer to pairs [*1,22] € V{Gt), whose vertices 
correspond to the same vertex i £ Gt, as good pairs, and 
to all others (e.g., [»i, J2]) as bad pairs. Also, we will refer to 
two pairs such as [ii, J2] and [21,(2]. or [i\,j 2] and [k\. J2], 
as conflicting. Finally, two adjacent pairs on V{Gt) will be 
referred to as neighbors. The seed sefj] will be denoted by 
Ao(n) C V, with cardinality do- 

We now briefly describe the Percolation Graph Matching 
(PGM) algorithm originally proposed in El- The PGM algo¬ 
rithm maintains an integer counter (initialized to zero) for any 
pair of V(Gt) that may still be matched. It exploits a set At , 
indexed by time step t, which is initialized (for t = 0 ) with 
the seed pairs. At any given time t > 0 , the PGM algorithm 
extracts at random one pair from At matching it, and increases 
by one the counter associated to each of its neighbor pair 
in V(Gt)- Then the algorihm adds to At+i all pairs whose 
counter has reached r at time t with the exception of those 
pairs that are in conflict with either any of the already matched 
pairs or any of the pairs in At- The algorithms stops when 
At = 0 - It is straightforward to see that PGM takes at most n 
steps to terminate. 

In the case where Gt is an Erdos-Renyi random graph, 

’ Wc will refer to the seed set as a subset of vertices, or, equivalently, of 
good vertex pairs, that have been identified a-priori. 
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previous work (6[ has established the following lower bound 
on the number of seeds that are needed to correctly match 
almost all nodes without errors. 


Critical seed set size for Erdos-Renyi graphs (6j. Let Gt 

be an Erdos-Renyi random graph G(m,p). Let r > 4. Denote 
by a c the critical seed set size: 


a c 


Hxssr 


(i) 


For naT 1 <C ps 2 < s 2 m~^ L , we have that, if a 0 /a c —> a > 1, 
the PGM algorithm matches w.h.p. a number of good pairs 
equal to m — o{rn) (i.e., all vertex pairs except for a negligible 
fraction) with no errors. 


Critical seed set size for random graphs bounded by 
Erdos-Renyi graphs. Let PKV.Eu) and K.(V.£k) be two 
random graphs insisting on the same set of vertices V, where 
£h C £ k , i.e., £h can be obtained by sampling £k ■ We 
define the following partial order relationship: "H(V,£/r) <st 
K.(V. £k)- Given that, below we extend our result in |[8l . 

Theorem 1: Consider Gt sastisfying: G(m,p m i n ) < s t 
Gt <st G(m,p ma ,x) with p min < p max . Applying the PGM 
algorithm to P(Gt) guarantees that m — o(m) good pairs are 
matched with no errors w.h.p., provided that: 

1. m —> oo; 

2. Pmin = ©(p max ) and p min » m 1 ; 

, __ 3,5 

3. Pmax 771 r , 

4. linim^oo a 0 /a c > 1, with a c computed from £Q> by setting 
P = Pmin- 

Also, under conditions l)-4), the PGM successfully matches 
w.h.p. all the correct pairs (with no errors) also in any subgraph 
Qj of Gt that comprises a finite fraction of vertices of Gt and 
all the edges between the selected vertices. The proof can be 
found in Appendix lAl 

Corollary 1: Under the same conditions as in Theorem |T] 
the PGM algorithm can be successfully applied to an imperfect 
pairs graph V x C V(G\) comprising a finite fraction of the 
pairs in V{Gt) and satisfying the following constraint: a bad 
pair [* 1 ,^ 2 ] £ V(Gt) is included in V(Gt) only if either [* 1 , 22 ] 
or \ji,j 2 ] are also in P(Gt)- 

Under the above conditions, the objective of this work is 
to design and analyze the network de-anonymization process 
when the ground-truth graph, Gt, exhibits different levels of 
nodes clustering. In particular, given Gt, G\ and Gi, we aim to 
determine the minimum size of the seed set that is required to 
successfully identify w.h.p. all good vertex pairs in V(Gt) with 
no errors. To this end, due to the big size of social network 
graphs, we perform an asymptotic analysis, i.e., we consider 
the number of vertices in Gt to grow very large (n —> 00 ). 


III. Clustered network model 

As detailed below, we model the social graph Gt as a 
geometric random graph. At the end of this section, we 
highlight how our model well captures node clustering and 
how it can represent network graphs with different values of 
clustering coefficient. 


We assume that nodes are located in a fc-dimensional space 
corresponding to the hyper-cub^3 'll = [0, l] fe C R fc , where 
the k dimensions correspond to different attributes of the 
user nodes. We consider the nodes to be independently and 
uniformly distributed over PL. Given any two vertices 1 , j £ V, 
with i j, edge (i,j) exists in the graph with probability 
Pij that depends on the Euclidean distance dij between the 
respective positions of the two vertices in H. We consider the 
following generic law for p , :] : 

Pij = K(n)f(dij). (2) 

In ©. / is a non-increasing function of the distance, and 
K(n) is a normalization constant introduced to impose a 
desired average node degree, D(n), which is assumed to be 
the same for all nodes. It is customary in random graph models 
representing realistic systems to assume that the average node 
degree is not constant, but it increases with n due to network 
densification. Also, although a common choice is to assume 
D(n) = 0(log n), in our model we consider D(n) = fl(log n) 
so as to encompass almost all systems of practical interest. 

Since we are interested in the asymptotic performance of 
graph de-anonymization as n grows large, it is convenient to 
further characterize the shape of function / as follows. Let 
us define C(n) to be at least equal to the minimal (in order 
sense) distance between nodes in H, i.e., n~ l l k . We assume 
that f(d) is equal to 1 for all distances 0 < d < C(n). This 
implies that K(n) must be less than or equal to 1 to obtain a 
proper probability function. For distances larger than C(n), we 
assume that / decays according to a power-law with exponent 
/3, with p > 0. In summary, 

/(<!«) =mi,ijl,(2M) | . (3) 

The above characterization of the shape of /(d) is fairly 
general and allows accounting for different levels of node 
clustering. In particular, our random-graph model degenerates 
into a standard Erdos-Renyi graph when C(n) = 0(1), 
with arbitrary 0. For 0 —> 00 , instead, we have a geometric 
graph, i.e., edges can be established only between nodes whose 
distance is smaller than or equal to C(n). 

The average node degree is: 

D{n) = Q lnK{n)(c k {n) +C f 3 (n) [ p k ~ l 

\ V JC(n) 

(4) 

Now, from 0 it follows that for 0 > k the dominant 
component of the neighbors of a given node lye at a distance 
0(C(n)) from it, while for 0 < k only a marginal fraction of 
the neighbors of a node lye at distance o(l) from it. Since we 
are interested in graphs with significant node clustering (so as 
to mimic real-world social networks), we restrict our analysis 
to the case 0 > k. In this case, the average node degree is 
given by: 

D{n) = Q(nK(n)C k {n)). (5) 

4 To avoid border effects, we assume wrap-around conditions (i.e., a torus 
topology). 
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Because by construction K(n) < 1, the average node degree is 
constrained to be 0{nC k (n)). Moreover, given that we assume 

D[n) = fl(logn), we have C(n ) = U ■ 

The clustering coefficient turns out to be 0(/v(n)), as a 
direct consequence of the fact that the major part of the 
neighbors of a node lye at a distance 0(C(n)) from it. In 
the following, we will slightly abuse the language and refer 
to groups of vertices that lye in sub-regions of side 0(C(n)) 
as clusters. Furthermore, we observe that, given the above 
expressions, the ratio of the clustering coefficient (0(AT(n))) 
to the graph density 0 (0(D(n)/n)) is 0(1/C fe (n)). This 
implies that our graph exhibits a high level of clustering. 
Indeed, since in general C k (n ) = o(l), the probability that 
two nodes are connected conditioned on the fact that they have 
a common neighbor, is higher (in order sense) than the average 
probability that any two nodes are connected. It follows that, 
K{n) and C(n) result to be the key model parameters through 
which we can directly control the clustering coefficient of the 
graph as well as the graph density. Thus, they play a crucial 
role in the analysis we present below. 

IV. Overview and main results 

In our analysis we address two cases: clusters with relatively 
sparse structure, i.e., K(n) = o{{nC k (?r)) -7 ) for some 7 > 
0, and clusters with extremely dense (up to a quasi-clique) 
structure, i.e., K{n) = oj((nC k (n)~ J )) for any 7 > 0. 

In the former case the cluster density goes to zero suffi¬ 
ciently fast as the number of nodes within the cluster goes 
to infinity ( nC k {n ) —> oo). On the contrary, the latter 
corresponds to a cluster density that either is bounded away 
from zero or goes to zero very slowly, with K(n) = 0(1) 
being a particularly relevant sub-case. 

We observe that, in the case of relatively sparse cluster 
structure, the density of edges between nodes within a cluster 
is not excessively large and, thus, PGM can be safely applied 
without the risk of incurring in significant matching errors. We 
therefore apply the following procedure to determine the mini¬ 
mum set size required for successful graph de-anonymization. 
We assume that the set of seeds lye in a small sub-region 
of H of size 0 (C(?t)) (i.e., within a cluster). Then, through 
the PGM algorithm, we de-anonymize all nodes that lye 
sufficiently close (within a prefixed distance) from the seeds. 
Once a significant bulk of pairs has been matched in this 
sub-region, the de-anonymization procedure is performed by 
successfully matching, at every stage, pairs that are sufficiently 
close to the previously matched pairs. Note that, starting from 
the second stage on, we do not apply PGM any longer but 
a simpler proximity-based strategy, matching those pairs that 
have a sufficiently large number of neighbors among the pairs 
matched at earlier stages. The way the matching procedure 
evolves is exemplified in Fig. [2] 

In the case of dense cluster structure, the whole procedure 
is slightly more complex in light of the fact that the clustering 

5 Given a generic graph Q(V, £), the graph density is defined as |v|(jv | —l) ' 
It can be interpreted as the probability that an edge exists between two 
randomly selected nodes of the graph. 


TABLE I 
Main results 


Scenario 

Minimum seed size 

K(n) = w((nC' t (n))-T), V 7 > 0 

0((nC*(n)Y) Ve > 0 

K(n) = o((nC fc (n))“ 7 ), with 7 > 0 

0 

( log nC fc (n) ) 



coefficient is larger, thus considering short edges while running 
the PGM algorithm would lead to matching a large number 
of bad pairs (as their counters will likely exceed the threshold 
r). It follows that we have to ignore all edges whose length is 
too short (shorter than a properly defined threshold cj(C(n))), 
in order to guarantee that almost no errors are made. More 
specifically, first we consider two groups of nodes that reside 
in two sub-regions of H of side h(n) = Q(C(n)), which are 
taken sufficiently apart one from the other (see Fig. [3]). Again, 
we assume that an opportune number of seeds is included 
in each sub-region. To de-anonymize all nodes in the sub- 
regions, we modify the PGM algorithm so that only the edges 
between the two different sub-regions are exploited. Then, 
by leveraging the presence of dense clusters, we show that, 
given two nodes in 'H. their mutual distance can be estimated 
quite precisely. Thus, given a sub-region where nodes have 
already been matched, we can select a set of nodes that are 
again sufficiently apart from the others and repeat the above 
procedure. The procedure can be iterated till almost all good 
pairs are successfully matched. 



Fig. 2. Graphical representation of the de-anonymization procedure for 

K(n) = o((nC fc (n))-T). 


h(n) h(n) 


<--> <--> 
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Fig. 3. Graphical representation of bipartite graph construction for K (n) = 

oj((nC fc (n)-T)). 

In Table [IV] we summarize our results on the minimum 
size of the seed set that is required for successful network 
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de-anonymization, when seeds are taken from compact sub- 
regions in 'H. Observe that the minimum number of seeds 
depends on both I\ (n) and C(n) while it is independent of /?. 
Specifically, in the regime of dense cluster structure (first raw 
of the table), the minimum number of seeds can be simply 
expressed in terms of the average number of nodes falling 
within a cluster ( nC k (n )). Indeed, a seed set whose size 
is equal to ( nC k (n )) e , for some e, is enough to guarantee 
an almost complete successful network de-anonymization. In 
the relevant case in which C(n) = 0(-2I-2) (i.e., when 
the average degree of the graph D(n) = 0(logn)), the 
above expression degenerates into (log n) e . This last expres¬ 
sion permits grasping immediately the potential impact of 
node clustering on de-anonymization techniques. Furthermore, 
somehow surprisingly, the minimum seed set size increases 
when we increase the average degree of the graph nodes, 
by increasing C(n). We remark that this is in sharp contrast 
with previous results derived for Erdos-Renyi and Chung-Lu 
graphs in [6), ®. The intuition behind this result is that, by 
increasing C(n), we increase also the cluster size making 
the problem of identifying nodes (users) within a cluster 
intrinsically more challenging. At last, when clusters become 
sparser (second raw of the table), de-anonymization techniques 
become less effective, and the minimum seed set size shows 
inverse proportional dependency on K(n). 


V. Sparse clusters 


In this case, we assume K{n) = o ((nC fc (n)) -7 ), for some 
7 > 0 , and a set of seeds Aq (|»4o| = ao) whose maximum 
mutual distance is d s = 0(C(n)). 

As the first step, we show how nodes in SL that lye 
sufficiently close to seeds can be identified. To this end, we 
start by defining two sub-regions, C H and H ou t C PL. 
Intuitively, "Hi,, CH out ) can be seen as a set of points whose 
distance from any seed vertex is higher (lower) than a given 
threshold. More formally, denote by x a generic point in % and 
by x CT the position in 'H of a generic seed vertex <7. Then, given 
two positive constants a and S, s.t. S < 1 and a( 1 + 5) < 1, 
we have: 


II 

c 

x s.t. max II 

l aeAo 

^OUt ) — 

< x s.t. min 

l (tGAo 


x-xj < / 1 ((1 + <5)o0j 
II X Xer || > / _1 ((1 - <5)a)} 


where / is the non-increasing function defined in Section [III] 
The two sub-regions are depicted in Fig. 0] Recall that, by 
construction, \H- m \ = Q(C k (n)) since f(d) vanishes for d 

C{n). 

The theorem below proves that, given graph Q\ (f/ 2 ), it is 
possible to correctly distinguish nodes in Hi n (ct, 5) from nodes 
in T~L 0 ut(d : a) by counting the number of their neighbor seeds. 


Theorem 2: Given a node i £ Q\ (i £ C/ 2 ), let S) be the 
number of seeds that are neighbors of i on C), {Q>). We say 
that node i is accepted if Si > asK(n)ao. If d s = 0(C(n)) 
and ao = G ^ lQS ) ’ t ^ len f° r an ^bitrary <5 > 0, 


m m 

¥ A 


. --^ "Horn (ot, S) 


Uin(a,6)\ 

1 

fen/ I { 




Fig. 4. Graphical representation of 'Hj n (a, 8) and "Hout(o,<5). 


the above procedure correctly accepts all nodes located in 
Win {a, 6) while it excludes all nodes located in 'H oat {a,5). 

Proof: See Appendix [A] ■ 

Note that, in the above statement, sK{n ) is the probability 
that a node in Gi (f/ 2 ) 0 i 5 connected with a seed node if their 
distance is C(n) or less. Thus, asK(n)ao is a threshold on 
the number of connections between a node and the a 0 seed 
vertices. 

Next, we denote by M l {a) and respectively, the 

set of nodes from Gi and C) 2 that are classified as located in 
Hm{a, 5). By construction, we have |A/’ 1 (a)| = 0(nC fe (n)) 
and |A/” 2 (a)| = Q(nC k (n)). We build the pairs graph V(N) 
that is induced by the nodes of Q\ and G 2 that belong to, 
respectively, M 1 {a) and A/” 2 (cr). While doing this, we make 
sure that a bad pair [«i, J 2 ] is included in V(Af) only if 
either [ 21 , 22 ] or [ji. j 2 ] are also included in V(N). This is 
accomplished as follows. We apply the previous classification 
procedure twice, using two different values a\ and a: 2 , with 
ai > a 2 , chosen in such a way that ’Ho U t(ai, 5) C 'H.\ a {ot 2 , S). 
Then we insert in V{AT) all pairs whose constituent nodes have 
been selected by at least one of the classification procedures, 
adding the constraint that at least one of the nodes must have 
been selected by both. Since by construction, no good pair 
[ii,i 2 ] exists s.t. i\ falls in ') and i 2 in H ou t(a 2 ,<5) 
(or viceversa), the above condition is ensured. 

We then apply the PGM algorithm on V(N). Our goal is 
now to verify that the conditions in Theorem |T] hold so that, 
applying the theorem and Corollary Q] we can claim that all 
good pairs in V(M) can be matched with no error. To this end, 
let us define m = ®(nC k (n)), which in order sense equals 
the number of nodes in A/" 1 (ct) and A/" 2 (a). Then recall that 
Pmin = 0(Pmax), Pmax = K(n) and K(n) = o(m“ 7 ). Thus, 
for a sufficiently large r, p m ax <C rn Furthermore, since 
by assumption nC k (n)K(n) = fl(logn), it follows p m ; n 
to - 1 . At last, it is easy to see that a 0 /a c —A 00 . Indeed, from 
©, a c = 0(1/K( n)) while, by assumption (see Theorem [2]), 
ao = 12 ( log^At)”)) )■ conclusion, we have that all good 
pairs whose nodes fall in 'H m (a\,5) can be correctly matched. 

To further expand the set of identified pairs, we can pursuit 
the following simple approach. Starting from the bulk of pairs 

6 Recall that Q 1 (O2) is a subgraph obtained from Qj by sampling the edges 
with probability s. 
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already matched, which act as seeds, we consider a larger 
region that includes the previous one. By properly setting 
a threshold r, we match all the pairs that have at least r 
neighbors among the seeds. So doing, we successfully match 
w.h.p. all good pairs in the region with no errors. More 
formally, the following theorem allows us to claim that our 
approach can be successfully employed. 

Theorem 3: Consider a circular region centered in 0 and of 
radius p, T>(0, p), with p > C(n). Given that all (or almost all) 
nodes lying within D( 0 ,p) have been correctly identified, it is 
possible to correctly identify (almost) all nodes in T>(0,pi) \ 
V(0,p) with a probability 1 — o(n” 1 ) for pi = p + C(n)/2 
when K(n ) = o((nC' fc (n)) -7 ) for some 7 > 0. In addition, 
none of the bad pairs formed by nodes in 'H — 'D(Q. p) will 
be identified with a probability 1 — o(n -1 ). This is done by 
setting a threshold r = j\V(0,p) fl D(x, C(n))| ^p^-, with 
|x| = pi and identifying as good pairs those in PL \ T>( 0, p) 
that have at least r neighbors among good pairs in V( 0, p). 
The proof is based on the application of standard concen¬ 
tration results, namely, Chernoff bound and inequalities in 
Appendix [A] The detailed proof is given in Appendix [A] 
Almost all good pairs can be matched w.h.p. by iterating the 
matching procedure of Theorem [3] a number of 0 (1 /C(n)) 
times. Indeed, each time the PGM algorithm successfully 
matches all good pairs whose constituent nodes lye within 
a distance C(n)/2 from the bulk of previously matched pairs. 
Note that Theorem [3] also guarantees that jointly over all 
rounds no bad pair is matched w.h.p. 

VI. Dense clusters 

The case K(n) = ui{{nC k {n)~ 1 )), for any 7 > 0, 

is significantly different from the previous one since the 
de-anonymization algorithm must disregard all edges whose 
length is too short (shorter than a properly defined threshold 
uj(C(n ))) so as to avoid errors (i.e., matching bad pairs). The 
approach we propose to address this case relies on some results 
that we initially obtain for the special case in which Gt is a 
bipartite graph. Then we extend such results to our clustered 
social graph and derive the minimum seed set size that is 
required for graph de-anonymization. 

A. Results on bipartite graphs 

Here we restrict our analysis to a ground-truth graph Gi that 
is an mi x m r bipartite graph. Let Mi denote the set of vertices 
on the left hand side (LHS), with \Mi\ = mi , and M r the set 
of vertices on the right hand side (RHS), with \M r \ = m r . 
We assume that for any pair of vertices i £ Mi and j e M r 
an edge (i,j) exists in the graph with probability pij , with 
7m in i Pij i) pm ax and pm ax ~ pPmin for some finite positive 
r/. The goal here is to identify a minimum number of seeds 
do, with ao = |_4 .qI in Mi and do = |.A5l in M r , such that 
vertices in Mi and M r can be correctly matched. 

Let us first consider the case where mi = m r = m, for 
which the theorem below holds. 

Theorem 4: Assume that Gt is an m x m bipartite graph 
and that two sets of seeds, A 1 ,, and Af of cardinality do, 
are available on, respectively, the LHS and the RHS of the 


graph. Then the PGM algorithm with threshold r > 4 correctly 
identifies m—o[m) good pairs w.h.p. on the RHS and the LHS 
of graph V{Gt), with no errors, if: 

1. m _1 < p min < Pmax < Ttl L 

2 . liminf m _ ! . 00 a 0 /a c > 1 


where a c 


(>-) ( 4 ^vr. 

V r) \ TO (PminS J ) r / 


Proof: See Appendix lAl ■ 

Theorem [4] can be extended to the more general case where 
mi 7 ^ m r , as shown by the corollary below. 

Corollary 2: Assume that Gt is an to; x m r bipartite graph 
and define m = min (mi,m r ). Under the same assumptions 
of Theorem [4] the PGM algorithm with threshold r > 4 
successfully identifies w.h.p. m — o{m) good pairs on both the 
LHS and the RHS of V(Gt), with no errors. Furthermore, the 
PGM algorithm can be successfully applied to an imperfect 
pairs graph V(Gt) C V(Gt) comprising a finite fraction of 
pairs on both the LHS and the RHS of 'P(Gi ) and satisfying 
the following constraint: a bad pair [ii, J 2 ] G V{Gt) is 
included in V(Gt) only if either [ 11 , 22 ] or [j \, ji>] are also 
in V(Gi). 

Proof: The assertion can be proved by following the same 
arguments as in Theorem [4] and applying Corollary Q] ■ 

Finally, we prove the following result, which shows that all 
good pairs can be matched with no errors w.h.p. 

Theorem 5: Consider that Gt is an mi x m r bipartite graph 
with mi = Lo(y/m r ) and that a seed set _4 q is available on the 
LHS of the graph, with \Ai\ = an = Q(mi). With probability 
larger than 1 — e , all the to,, good pairs on the RHS can 
be successfully identified with no errors, provided that: 

1 ■ ^ P mln — Pmax 1 

2. Pmin = G) (/7nax ) 

3. a matching algorithm is used on V{Gi ) that matches all 
pairs on the RHS that have at least r adjacent seeds on the 
LHS, with r = ao^j 2 -. 

The same result holds in case of imperfect pairs graph com¬ 
prising a finite fraction of all possible pairs on the RHS. 

Proof: Without loss of generality, we assume a o > cm r 
for some c > 0. The proof is obtained by applying the 
inequalities reported in Appendix [A] First, observe that, given 
a good pair [j \, j-f\ on the RHS of the pairs graph, its number 
of adjacent seeds on the LHS is E[N g ] > aop m in = 2r. Thus, 
by applying inequality £7} and union bound, we have: 


P(all good pairs on the RHS have at least r adjacent seeds) 

> 1 - TO r e“ cmiPmi ” ff( 5) > 1 - 


which imply that all good pairs on the RHS are successfully 
matched since mi = uj(^m r ). Similarly, considering a bad 
pair [ji, k-f\ on the RHS, its number of adjacent seeds on 
the LHS is E[Nb] < cm r (p ma f) 2 <C r. Thus, by applying 
inequality (O and union bound, we have: 


P(all bad pairs on the RHS have less than r adjacent seeds) 


\ i -cmi- 

> 1 — m„e 


■ log! 


> 1 - e“ 
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B. The de-anonymization procedure 

We now outline how our proposed de-anonymazion tech¬ 
nique works. First, we consider two hyper-cubic regions, 
TLi C TL and TL r C TL, whose side is h(n) = £l(C(n)) and 
whose distance is g(n) = u(C(n)) (see Fig. [3]). Note that by 
construction, given two vertices i £ TLi and j £ TL r , p ln \ n = 
K(n)f(g(n) + y/kh(n)) < Pij < K(n)f(g(n)) = p max . Let 
us assume p max = rjPmin for some constant rj > 1 . 

We then extract vertices in TLi and TL r from the rest of 
vertices so that we can focus on the bipartite graph induced 
by the nodes in the two sub-regions, along with the edges 
between them. To this end, we assume that two sufficiently 
large sets of seeds are available in TLi and TL r so that Theorem 
[2] can be applied. In this regard, observe that we can use the 
same procedure as in Section [V] to make sure that a bad pair 
[*i)j 2 ] is included in the pair graph only if either [i i,z 2 ] or 
[ji. j-f\ are also included in it. We can then apply Corollary U 

It follows that the execution of the PGM algorithm ensures 
that almost all of the good pairs in either the LHS or the RHS 
of the pairs graph are correctly de-anonymized. Without lack 
of generality, we assume that almost all pairs on LHS are de¬ 
anonymized, i.e., to; < m r , and that a non-negligible fraction 
of the good pairs on the RHS have still to be identified. Then 
the rest of good pairs on the RHS can be matched by applying 
Theorem 0 

To further extend the de-anonymization procedure, we first 
observe that it is possible to estimate in order sense the length 
of the edges between two nodes, again, by exploiting the dense 
structure of the clusters. 

Proposition 1: Given two nodes in region TL, it is possible 
to estimate with arbitrary precision their mutual distance d as 

far as d <C C(n ) (nK 2 (n)C k (n)) ? . 



Xi 


Fig. 5. Computation of E[AI;j], 


Proof: Let us consider two nodes i and j on Q\ (G 2 ) 
whose mutual distance is dij. Let N, :i be the variable that 
represents the number of their common neighbors. By con¬ 
struction, we have: 

E[N ij ] = {n-2)s 2 K 2 (n) I /(||x - Xi||)/(||x - Xj||)dx 
Jn 

= Q(nC k {n)K\n)f{d ij )). 


Observe that E[iVy] is continuous and strictly decreasing with 
dij, and thus invertible. Now, applying Chernoff bound we can 
show that for any 0 < S < 1 


( \ Ni i ^ E [%]l > s) < e -c(S)K[Nij 

{ E[i\y 


for a proper constant c(S) > 0. Furthermore for (5 > I 


/ Njj \ - c (5)(<5E[iV i j] log5) 

F W[iVy] 7- 


Since E[IVy] —> 00 as long as d <C C(n) (nK 2 (n)C k (n)) @ , 

the assertion follows. ■ 

We can therefore use the number of common neighbors 
between two endpoint nodes as an estimator for their distance. 
We then set two thresholds, dz, = 0(C(n) log(n 1 / fc C'(n))) 
and dn = XdL (with A > 1), and we leverage the above 
result to correctly classify the edges departing from previously 
matched nodes into three categories: edges that are shorter 
than dr, edges that are longer than dn and edges of length 
comprised between di, and d//. In particular, we are interested 
in the latter, for which the following result holds. 

Proposition 2: Assume K[n) = w((nC' fc (n)) -7 ) V 7 > 0. 
Consider a set comprising a finite fraction of the nodes of 
Q 1 (C/ 2 ) that lye in a region of side Q(C(n)), and the edges 
incident to them. For an arbitrarily selected S > 0, w.h.p (i.e., 
with a probability larger than 1 — [ C(n )] k ) we can select 
all edges whose length d is (1 + <$)dz, < d < (1 — S)dn- 
Furthermore, no edges whose length d < (1 — d)dz, and 
d > (1 + 6)dn are selected. 

Proof: The proof follows the same scheme of proof of 
Theorem [2 here we provide just a sketch. 

Fix a 5 > 0 , first we consider all edges whose lenght does 
not exceed (1 —d)di By applying Proposition [Hand the union 
bound, the probability that they are jointly not selected can be 
bounded by: 

P(some edge with length d < (1 — S)dL is selected) 

< N e e- c ' nGk ( n ' >K2< ' n ^^ 1 - s ^ dL ' > 

where N e is the number of edges with length d < (1 — <5)dz, 
and d is an opportune constant. Now since by construction 
N e = 0(nC k (n))D(n)) = 0((nC k {n)) 2 K(n)) and d L = 
Q{C(n) log C(n)) none of those edges is included. Similarly 
we can show that all edges whose length is (1 + 5)sl < d < 
(1 — 6) are selected. 

To show that none of the edges whose length is exceeding 
du{ 1 + 5) are selected we resort on the same ideas of the 
proof of Theorem [2] In particular, we partition such edges into 
smaller groups containing only those edges of similar lenght. 
For each of groups we have defined, we exploit Chernoff 
inequality along with the union bound (similarly as before) 
to provide an upper bound to the probability that at least one 
of such edges is selected. We can conclude our proof showing 
that previous property holds uniformly on all the groups. ■ 
At this point, we consider a bipartite graph whose LHS 
is still represented by TLi, and whose RHS is given by the 
nodes that are connected with those in TLi through edges of 
length comprised between d/, and d//. We can therefore apply 
Theorem |3 and match w.h.p. all good pairs on the RHS, with 
no errors. The procedure is then iterated so as to successfully 
de-anonymize the whole network graph. Note that, at every 
step we apply the following proposition to extract a group of 
matched nodes whose mutual distance is 0 (C(n))). 

Proposition 3: Assume K(n) = w((nC' fc (n)) -7 ) V 7 > 0. 
Given a node i, we can set a threshold dr = ©(C'(n)) and 
select all nodes in Gi (G 2 ) whose estimated distance from i is 
less than dr- So doing, for an arbitrarily selected d > 0, we 
successfully select with a probability larger than 1 — [ C\n)] k 
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all nodes whose real distance is d < (1 — d)dr- Furthermore, 
no nodes whose distance from i is d > (1 + 6)cIt are selected 
by our algorithm. 

Proof: The proof follows exactly te same lines as the 
proof of Propostion U ■ 


C. Minimum seed set size 

To explicitly derive the minimum size of the seed set, 
we need to further specify h(n) and g(n), which are to be 
carefully selected so as to minimize the resulting critical size 
a c in Theorem [4] and Corollary [2] 

Starting from the result provided by Theorem |4] a c can be 
written as: 

( (r-l)l 

\m(p m inS 2 ) r 
r — 

V (miPminS 2 ) 

The above expression can be minimized by maximizing p m i n , 
i.e., by minimizing g{n ) (recall that p m i n = K(n)f(g(n) + 
y/kh(n))). However, g(n) and h(n) must also be selected in 
such a way that condition 1) of Theorem|4]is met. Additionally, 
as mentioned, it must be ensured that h(n) = Q(C(n)). At 
last, by standard concentration results, to; and m r turn out to 
be both 0(nh k (n)) provided that h(n) > (log n/n) l ^ k . 

Previous considerations induce us to fix h(n) = 0(C(n)) > 
(log n/n) x ! k (i.e., the minimum possible value in order 
sense), which corresponds to have to = <d(nC k (n )) (recall 
that to = min(m;, m r )). We then derive g(n) by forcing 
Pmax ~ m - ^, with 3.5 < a < 4 and r > 4. Note that 
condition 1) of Theorem H] is met since p max and p 1T ,j,, are 
both 0(m _ “). Hence, we have p max = Q((nC k (n))~~) and 
g(n) = &(n^[C(n)] 1+ ^[K(n)^)). 

Given the above expression for Pmax, considering that 
Pmax = V Pmin ^nd Using ©, the minimum seed set size can 
be made as small as 

a c = 0([nC k (n)] e ) 

for any e > 0, by choosing r > -. 

Finally, we remark that the obtained a c is in order sense 
greater than the minimum number of seeds needed to apply 
Theorem [2] while selecting nodes in regions Hi an H r , thus 
the whole construction is consistent. 

VII. Experimental validation 

Although our results hold asymptotically as n —> oo, we 
can expect to qualitatively observe the main effects predicted 
by the analysis also in finite-size graphs. We will first in¬ 
vestigate the performance of graph matching algorithms in 
synthetic graphs generated according to our model of clustered 
networks, and then apply them to real social network graphs. 

A. Synthetic graphs 

In this section we consider bi-dimensional graphs having 
n = 10,000, the sampling probability s = 0.8 and, unless 


< 


r — 1 


L Pmin'-’ 


PminS 


2 ' 


( 6 ) 




otherwise specified, the average node degree in the ground- 
truth graph D(n) = 30. 

Fig. [ 6 ] reports the average number of correctly matched 
nodes across 1,000 runs of the PGM algorithm (using r = 5) 
in various cases, as function of the number of seeds. In each 
run, seeds are either chosen uniformly at random among all 
nodes (label ‘uniform seeds’), or as a compact set around one 
randomly chosen seed (label ‘compact seeds’). In our model of 
clustered graphs, we have fixed (3 = 3 (the decay exponent of 
the edge probability beyond C(n)), and we consider either 
K(n) = 0.05 or K(n) = 0.2. As reference, in the plot 
we also show the phase transition occurring (at about 600 
seeds) when Qj is a G(n,,p) graph having the same average 
node degree. The plot confirms the wave-like nature of the 
identification process as predicted by our analysis, namely: 
i) clustered networks (larger K{n)) can be matched starting 
from a much smaller seed set as compared to G(n,p)\ ii) such 
huge reduction requires seeds to be selected within a small 
sub-region of H. 
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Fig. 6. Comparison of PGM performance (with r = 5) in different networks 
with n = 10, 000. Number of good matches (averaged over 1,000 runs) as a 
function of the number of seeds, chosen either uniform or compact. 


What the plot in Fig. [ 6 ] does not clearly show (except 
for a rough estimate based on the maximum number of 
correctly matched nodes) is the error ratio incurred by the 
PGM algorithm, which is expected to become larger and 
larger as we increase the level of clustering in the network. 
This phenomenon is confirmed by Fig. [7J which reports the 
average error ratio (bad matches over all matches) incurred 
by PGM as a function of K(n), starting from a compact set 
of seeds. In Fig. [7] we have considered also different values 
of f3. The little circle denotes the operating point already 
considered for the left-most curve in Fig. [ 6 ] having an error 
ratio of about 5%. The plot reveals that the error ratio increases 
dramatically when K (n) tends to 1, confirming that PGM 
cannot be safely applied in highly clustered networks. The 
effect of (3 is more intriguing: smaller /3’s produce fewer errors 
since generated network graphs tend to become more similar 
to G{n,p), where PGM is known to perform very well. As 
side-effect, smaller values of /3 tend to slightly increase the 
percolation threshold (not shown in the plot). For example, 
for K[n) = 0.4, the critical number of seeds (estimated from 
simulations) corresponding to (3 = 2.2,2.5,3,4 are equal to 
11,15,24,45, respectively. 
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K(n) 


Fig. 7. Error ratio of PGM as a function of K(n) for different values of (3, 
starting from compact seeds. 



# seeds 

Fig. 8. Average number of good and bad pairs matched by different algorithms 
for K(n) = 0.8, 3 = .3, starting from compact seeds. 


Next, we focus on the ‘hard’ case corresponding to the little 
square shown in Fig. [7J i.e., K{n) = 0.8, /3 = 3. This case 
corresponds to networks having highly dense clusters, where 
the performance of the original PGM algorithm is rather poor 
(error ratio about 50%). Fig. [8] shows the average number 
of nodes matched by different algorithms as a function of 
the number of seeds: thick lines correspond to good matches, 
whereas thin lines (with the same line style) refer to bad 
matches produced by a given algorithm. For sake of sim¬ 
plicity, network de-anonymization is performed by applying 
a simplified version of the algorithm proposed and analysed 
in Section [VI] This simple algorithm consists in adopting PGM 
after having removed all graph edges shorter than x ■ C\n). 
In the following, we will call this algorithm ‘filtered PGM’ 
and we will label the corresponding curves in the plots by 
‘/ =<x>’. We stress that filtered PGM provides qualitatively 
similar results to the performance of the algorithm in Section 

eh 

Looking at Fig. [8j it is important to remark that in this 
scenario the performance of the various algorithms is highly 
sensitive to the location of the set of seeds (in each run we 
uniformly select one seed among all nodes, and put all of 
the other seeds around it). Since we average the results over 
1,000 runs, this explains why all curves do not exhibit a sharp 


TABLE II 

Combinations of parameters achieving error ratio 3%, 
PERCOLATION PROBABILITY 50% 


average node degree 

f 

# seeds 

36 

1.1 

22 

45 

1.2 

24 

53 

1.3 

28 

64 

1.4 

32 


transitiorQ. An average number of matched nodes equal to, say, 
2 ,000, must be given the following probabilistic interpretation: 
about 1/5 of (uniformly chosen) initial locations allow us to 
match almost all nodes (10,000), while 4/5 of initial locations 
do not trigger the percolation effect. 

Also, we note that the poor performance of standard PGM 
cannot be fixed by just increasing the threshold r: using r = 7, 
PGM still produces about 12% error ratio, while requiring 
many more seeds (only about 2,000 nodes are matched on 
average starting from 100 seeds). Instead, filtered PGM, with 
/ = 1 and r = 4, requires very few seeds to match almost all 
nodes, incurring about 3.7% error ratio. Using / = 1, r = 5, 
filtered PGM requires more seeds, but achieves as low as 0.3% 
error ratio. 



Fig. 9. Effect of varying the filtering factor f for fixed r = 4 (scenario with 
K{n ) = 0.8). 

Next, we fix r and increase the filtering factor / so as 
to diminish the number of errors while, however, reducing 
the average number of matched nodes (i.e., the probability to 
trigger percolation from a given seed set). Fig. [9] illustrates this 
effect for r = 4, in the case of two different seed set sizes, 
30 and 60. Having 60 seeds one could, for example, employ 
/ = 1.1 obtaining very high chance of percolation (almost 
100 %) and small error ratio (around 1%). 

Alternately, we can fix a desired error ratio and average 
number of matched nodes (i.e., the probability to trigger 
large-scale percolation), and look for the filtering factor and 
number of seeds that let us achieve these goals. Table IVH-AI 
reports an example of this numerical exploration, in which we 
vary the average degree of the nodes in Qj corresponding to 
each examined scenario (the average degree can be increased, 
for fixed K(n) = 0.8, by increasing C{n)). The results in 

7 We verified that, if we instead fix the very first seed across all runs, a sharp 
transition appears. However, the transition threshold changes as we vary the 
initial seed (results not shown here). 
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Table [VILA] validate, at least qualitatively, the counter-intuitive 
theoretical predictions in Table ITVl as we increase C(n) (and 
thus the average node degree), the seed set size necessary to 
achieve a desired matching performance increases as well. 

B. Real social graphs 

We consider a real graph derived from the Slovak social 
network Pokec G2). The public data set, available at Gl, 
is a directed graph with 1,632,803 vertices, where nodes 
are users of Pokec and directed edges represent friendships. 
Since the original graph contains too many vertices for our 
computational power, and since we would like to isolate the 
impact of clustering from the effect of long-tailed degree 
distributions, we considered only vertices having: i) in-degree 
larger than 20; ii) out-degree smaller than 200. We ended up 
with a reduced graph having n = 133, 573 nodes, average 
(in or out) degree 40.8 and clustering coefficient 0.1. We use 
this graph as our ground-truth, and employ an edge sampling 
probability s = 0.8. Notice that we maintain the direct nature 
of the edges, since all considered algorithms immediately 
apply to direct networks as well Q. 



Fig. 10. Performance of matching algorithms in a subset of the friendship 
graph of the social network Pokec. 


Fig. [TO] shows the performance of the different algorithms 
using threshold r = 6. As before, curves labelled ‘uniform’ 
refer to the PGM algorithm in which seeds are selected uni¬ 
formly at random among the nodes. Curves labelled ‘compact’ 
refer to the PGM algorithm in which seeds are chosen among 
the closest neighbors of a uniformly selected node. Curves 
labelled ‘filter 10’ differ from the previous one in that the 
edges connecting each node to its nearest 10 neighbors are 
not used by the algorithm. We emphasize that a G(n,p) 
having the same number of nodes and average degree would 
require a c = 5, 783 seeds, according to (jT}. In contrast, all 
considered algorithms require much fewer seeds to match 
almost all nodes, confirming that real social networks are 
much simpler to de-anonymize than G(n,p). In particular, 
the uniform variant requires about 300 seeds to match on 
average more than 100,000 nodes, but incurs a quite large error 
ratio (about 17%). The compact variant reduces this number 
roughly by a factor 3, but produces the same error ratio. At 

8 In direct networks, counters of matchable pairs are incremented only by 
using outgoing edges from matched pairs. 


last, the filtered variant requires slightly more seeds than the 
compact one, but it allows to lower down the error ratio to 
about 4%. The above results confirm the crucial performance 
improvement that can be obtained by jointly: i) starting from a 
compact set of seeds (to exploit the wave-propagation effect), 
ii) carefully discarding edges connecting nodes to their local 
clusters (to limit the errors). 

VIII. Conclusions 

We focused on the effect of node clustering on social graph 
de-anonymization. We defined a general model for network 
graphs that can represent different levels of node clustering. 
Then we designed de-anonymization algorithms and analysed 
their performance by using bootstrap percolation. Our theo¬ 
retical results highlight that clustering significantly helps to 
reduce the minimum seed set size required for network de¬ 
anonymization, and that our algorithms can successfully limit 
the error rate of the de-anonymization procedure. Our findings 
were confirmed by numerical experiments on synthetic and 
real social graphs. 


Appendix 

Lemma 1: Let H(b) = 1 — b — b log b for b > 0. Suppose 
n £ N p £ (0,1) and 0 < k < n let /x = np if k < p then: 

P(Bin(n,p) < k) < exp pH ^ (7) 

if k > p then: 

P(Bin(n,p) > k) < exp pH ^ (8) 

if k > e 2 p then 

f k k\ 

P(Bin(n,p) > k) < exp (-log — ] (9) 

V 2 /V 

Without loss of generality, let us focus on Q\ and let 
us consider a node i £ r H m (a,5). By construction, the 
number of seeds that are neighbors of i on Qi is given by 

Si = X ^ S l a >st Yi > st Y where 

Yi = Bin(a 0 , sK(n)f (max ||x, ; - x CT 11)) 
crGAo 

and Y = Bin(ao, sK(n)(l + 5)a), with E[L] = sK(n)(l + 
S)aa o- Now, using the inequalities reported in Appendix [A] 
we can bound: 


P (Yi < asK(n)ao) < exp E[lj]fT^ 


' asK(n)ao 
E[K,] 


< 


exp (1 + S)asK(n)a 0 H ^ + (10) 


with H(b) = 1 — b + b log b. 

If we consider jointly all nodes in 'H m (a 1 6) and we denote 
with iVi n their number, we can bound the probability that every 
node in Hi a (a, 6) is accepted with: 


P (all nodes in P m are accepted | Aj n ) 

< 1 - W in exp f~(l + (j)asA'(n)a 0 P^Y^Tj)) ’ 
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II 


with 


that 


tends 


to 1 if logTVin — (1 + 
—oo. This can be enforced by 

log Nin ' Since by construction 


CD 

S) asH ^ ^ K(n) a 0 

opportunely setting ao = fi ^ K ^ 

|Win| > C k (n) > we have w.h.p. N m < 2n\Hm\ 

by standard concentration results (See Lemma O ). As a 
consequence, w.h.p. 

P (all vertices in H ; n are accepted) —> 1 
provided that a o is opportunely chosen, with: 

_ r) ( log(»C fc (n)) \ 

a 0 y K ( n ) )■ 

Then we focus on the nodes in T~L oat (a, 5) and we show that 
all those nodes are jointly rejected. Conceptually we repeat 
the same approach as before, however, the argument is made 
slightly more complex by the fact that, to achieve tight bounds 
on the probability that all nodes in W 0 ut(ct, <5) are jointly 
rejected, we need to partition H out (a,6) into smaller sub- 
regions containing nodes, which lie at similar distance from 
the seeds. 


Assuming 5 < 


e —1 


, we define = W 1 (c 


e —1 


) c 


Wout(ct, S ) and W° ut (a, S ) = 7f out (a, S)\7il ut . Furthermore, we 
partition 'H* ut into disjoint sub-regions, i.e., Wq u1 = Uh^iP-lut, 
with H 1 * = Wout( a ’f/ e r 1 ) \ W QUt (a, j~. Now, 

given a vertex i in W° ut CHlui), the number of its neigh¬ 
bor seeds Si on Q± can be bounded from above by a 
Bin(ao, sK(n)( 1 — S)a) ^Bin(ao, ct)^. Furthermore, by 
elementary geometrical arguments, it can be shown that: i) 
|W o 0 J = e(C k (n)l ii) IWL 1 ! = B(C fc (n)) and iii) U 1 * = 

Denoted with 7V° ut and the number of nodes in T~L aut 
and P-Im, respectively, by exploiting again the inequalities in 
Appendix lAl w.h.p. we have: 


P (all nodes in are rejected) < 

1 — ./V^exp (1 — S)asK(n)aoH(y — S'j'j —» 1. 

The above expression holds under the assumption that ao = 
f2 ^ 1 ° s ()(, f ) n | w )) j. Indeed, we remark that N® at < 2n\H® ul \ = 
Q(nC k (n)) w.h.p. At last, 

P (all nodes in l~L\ ul are rejected) 

< 1 - N °* exp (“ asK ^ a ° (f3 log h + 2)^ . 

For every h, N^t < 2n\'H 1,h \ = Q{nh k ~ l C k (n))\ also, 
the number of sub-regions of Hl ut is 0(n/C k (n)). Thus, 
w.h.p we have that jointly on all h’s, the number of nodes 
in these sub-regions can be bounded by 2n\'H 1,h \. Under the 
assumption that ao = 12 | Tog(nC^(ra)) ^, - t can eas jjy g^own 

that P (all nodes in T~Ll at are rejected) —► 1. 

The following proof uses some notation that has been 
introduced in |6| and that here is omitted for brevity (the reader 
may also refer to Appendix 0 for a more detailed description 
of the PGM algorithm and associated notation). 

For any two vertices i G A4i and j G A4 r , let X tJ be 
the Bernoulli random variable that represents the presence of 


an edge (i. j) G £. By construction, Ber(p m j n ) < st Xij < st 
Ber(p max ). I.e., two variables AT, and X iv with distribution, 
respectively, Ber(p min ) and Ber(p max ), can be defined on the 
same probability space as X^ such that X A ,, < X i: j < X i; j 
point-wise. 

We consider the corresponding pairs graph ^(^r), which is, 
by construction, composed of all the pairs of vertices residing 
in Mi and M r and of the edges connecting pairs of vertices 
in Mi with pairs of vertices in M r . We denote by Pi and 
V r , respectively, the set of pairs of V(Gt), whose vertices lie 
in Mi and M r - Observe that, given two good pairs [ii, *2] G 
Vi and [ji, J2] G V r , the presence of an edge in V(Qj) is 
associated with the random variable: 

Y [ii,i2Uhj 2 ] = XijXijS^Sij = x ; j s ;j .s; ; 

where Sjj and ,S'j) are mutually independent Ber(s) r.v’s, 
which are in turn independent of X, :i . By construction, 
PminS 2 < E [^[*1,* 2 ],[31,J2]] ^ PmaxS 2 . Instead, given two 
bad pairs [i 1 ,k 2 \ G Pi and \j 1: l 2 ] G V r , Y [h,k 2 ],[h,h] = 
XijXuSkSh, with p 2 min s 2 < E < P 2 max s 2 . 
Finally, if we consider one good pair and one bad pair 
(e.g., [*i,i 2 ] G Pi and \ji,k 2 ] G P r ), Y [iui2 , [hM] = 
XijXikSjjSf k , with p 2 min s 2 < E[yi iliia]ib - liia] ] < pLxS 2 - 

Recall that we assume that two seed sets, A l Q G Pi and 
Aq G V r (with |^4 q| = |.4g|), are available. On P(Qj) we 
run the PGM algorithm |6], opportunely modified, as follows. 
At every time step t, we extract uniformly at random one 
pair z l (t) = [z[,z l 2 ] t G A\_ x \ Z\_ x and z r (t) = [z{,z r 2 ]t G 
A' t _ j \ Z{_ 1, adding a mark to all the neighbor pairs in P r and 
Pi, respectively. In other words, matched pairs in Pi contribute 
to the mark of pairs in P r and vice versa. Thus, for a generic 
node pair [i\,j 2 ] G P r \-Z[, marks are updated according 
to the iteration: = M^ h] (t - 1) + 

Similarly, for [*1,^2] £ Pi marks are updated according to 
M [ h = Af[ illia ](t - 1) + Y [ilih ]|B r (t) . For the rest, the 
algorithm proceeds exactly as described in Section QI] 

Now, it is important to observe that marks of pairs on 
the RHS of the graph evolve exactly as the marks of a 
coupled PGM that operates over a pairs graph Pr defined 
as follows. Denote the generic pair by [*i,*2]; then Pr is 
a graph insisting on the set of nodes M r and in which the 
presence of edge (z r (i), [*1, *2]), for any [*1, * 2 \ G P r \ Z r t , 
is dynamically unveiled at time t by observing variable 

x *i(t ) * 1 x *iW* a S ii ( t ) . 1 S ^L ( t ) . a - In other words ’ the edges 

originated from z l (t) are replaced by the edges originated from 
z r (t) and viceversa. 

Furthermore, we make the following observations. 

(i) We assume that the sequence of matched pairs {z^} t G 
P IR > exactly corresponds to the sequence of matched pairs 
{z r (t)}t G P r , i.e., z r (t) = z R (t) at every t. This is made 
possible by the fact that given Z ’ t _, = Z f R _ 1 , marks collected 
by every unmatched pair in the two graphs at time t exactly 
correspond. 

(ii) Our construction is consistent since edges between pairs 
are unveiled only once, specifically at the time at which the 
first between the two edge endpoints in Pr is placed in Z^ = 
Z"[. Since then, the edge is replaced with an edge between two 
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Fig. 11. Graphical representation of the PGM evolution over coupled graphs. 

pairs that are both in Vr, hence it will not be used again. 

(iii) Vr is isomorphic to a pairs graph originated by a 
generalized Erdos-Renyi graph Qj, in which the presence 
of every edge (z r (t),*) can be represented by a Bernoulli 
r.v. and the probability that the edge is added to the graph 
takes values in the range [p m j n . Pmax] and is independent of 
other edges. Indeed, observe that the presence of an edge 
in Vr deterministicaly corresponds to the presence of the 
corresponding edge in V{Qj). Furthermore, by costruction, 
different edges in Vr correspond to different edges in V{Q\). 

The same observations hold when we consider the evolution 
of the marks of the pairs on the left hand side and a pairs graph 
Vl, which is originated from a coupled generalized Erdos- 
Renyi graph Qj with same properties as Qj. 

Now, clearly G(m,p min ) < st Qj < st G(rn,p max ) and 
G(m,p min ) < st Qj < s t G(w,p max ), i.e., Qf (Q -f) can be 
obtained by opportunely thinning a graph G(?n,p max ), while 
a graph G(m,p m ; n ) can be obtained by opportunely thinning 
Qj (Qj)- Then we invoke Theorem Q] to conclude our proof 
and show that our algorithm correctly percolates over Qj and 
Qj and, thus, over the original bipartite Qj. 

Our matching procedure requires to extract (select) nodes 
that lye in a defined region Ho- 

Clearly, to extract nodes lying in a defined region with¬ 
out errors, it is necessary to have direct access to vertices’ 
positions. However, our algorithm has access only to graphs 
Q i and C/ 2 (i.e., their adjacency matrix), and thus it extracts 
nodes based on “estimated” positions/distances (i.e. according 
to Theorem ?? or Proposition ??. 

Thus, if we extract nodes on the basis of their estimated 
position, we will necessarily incurr in some error: some nodes 
in 'Ho will not be selected while others lying outside "Ho will 
be selected. We denote with V(Ho) the set of pairs whose 
nodes lye in Ho and with V(Ho) the set of pairs composed 
by nodes that are extracted. 

We need to devise a smart strategy that extracts nodes while 
guaranteeing that the following three conditions are satisfied: 

1) Only good pairs formed by vertices whose actual location 
is in Ho (i.e. good pairs in V(Ho)) are extracted; 

2) A finite fraction (bounded away from 0) of good pairs of 


V(Ho) is estracted (i.e., included in V(Ho)’, 

3) The following situation occurs with negligible probabil¬ 
ity: a bad pair [ii,j 2 ] is included in V(Ho) while none 
of the pairs [*i, * 2 ] and [ji-j'i] are included. 

The third condition ensures that every selected bad pair is in 
conflict with at least one good pair in the set, thus it will 
not be matched by the PGM algorithm when it (eventually) 
reaches the threshold. Below, we show how conditions 1) 2) 
and 3) can be easily guaranteed. For simplicity, we restrict our 
attention to spheric regions, although the same argument can 
be applied to regions of any shape. 

We first introduce this preliminary result. 

Proposition 4: Assume that position of nodes (lenght of 
edges) are estimated with a bounded error A. Then, given 
a spheric region Ho whose side is not smaller than 7A, it 
is possible to extract a set of nodes from Q\ and Q-> (and 
consequently to define V{Ho)) satisfying conditions 1 ), 2 ) and 

3). 

Proof: 

We select nodes as follows. We partition region Ho into 
three disjoint sub-regions. An inner spheric region of radius 
3A co-centered within Ho, an intermediate annulus-shaped 
region with external radius equal to 5A, and a remaining outer 
region. 

The idea is to extract only those pairs of vertices whose 
estimated position falls in either the inner or the intermediate 
region, under the additional condition that only pairs for which 
at least one vertex falls in the inner region are extracted. This 
expedient implies that [A, J 2 ] is selected only if the estimated 
location of i\ (]•>) falls in the inner region and the estimated 
position i ‘2 (j 1 ) falls in either the inner or the intermediate 
region. Clearly, the true position of i \ (j 2 ) must necessarily 
lie in Ho- Furthermore, all nodes whose true position falls 
in a spheric region of radius A co-centered with Ho will be 
necessarily selected, thus conditions 1 ) and 2 ) are met w.h.p. as 
immediate consequence of Lemma|2] Finally, 3) is necessarily 
met as result of the following argument, (i) Observe that, for 
every node i, the distance between the estimated positions of i\ 
and (2 is by construction smaller than 2coG(n). (ii) Then let us 
consider a selected bad pair [A, j 2 ]; without lack of generality, 
we can assume the estimated position of 1 \ to lye in the inner 
region. From consideration (i), the estimated position of i 2 
must necessarily lye either in the inner or the intermediate 
region, (iii) As a result, the pair [ii, z 2 ] is necessarily selected 
too by our algorithm. 

Proposition 5: The same approach can be pursuit in the 
case of the application of Theorem [2] to define the initial set 
of vertices pairs V(N) so as to satisfy condition 3) (along 
with 1 ) and 2 ). 

Indeed, in such a case the role of the inner region is played 
by T>[ n (aiS), the role of intermediate region is played by 
V>i n (a 2 S) \ T> m {a\5) while the role of outer region is played 
by 22 0 ut(a:i<5) \ V> m (a\5). Indeed, by construction, if a vertex 
ii is accepted by adopting a threshold ol\, the corresponding 
vertex i 2 will be necessarily accepted by adopting a threshold 
0 - 2 - ■ 

Lemma 2: The number A'- W(l of nodes falling in a region Ho 
satisfies §|"Ho| < Nr 0 < 2n\Ho\ w.h.p., as long as \Ho\ = 
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w(i). In particular, if \U Q \ > c^jp, then §|'H 0 | < N n0 < 
2n|'H 0 | with a probability 1 — 0(n cff ( 1,/2 )). 

Proof: The proof immediately descends by applying (01 
and © to N-fio = Bin(n, \Ho\) with p = E[JVh 0 ] = n|7f 0 |. 



Fig. 12. Graphical representation of T>i(hi) and T> 2 (h 2 ) 

Proof of Theorem 0 The proof of this proposition is based 
on the application of standard concentration results, namely, 
Chernoff bound and inequalities reported in Appendix [0 For 
the sake of clarity, we restrict ourselves to consider the case 
k = 2; the extension to a generic k is easy to obtain. 

Consider a correct pair [*i, *2] G 2?(0, pi) \ T>(0, p) whose 
location in TL is denoted by x,. We compute its number of 
edges with pairs in V(0,p), N z = o, P ) ^[*1^2] [W2]] > 

'^ie'D(o,p)nv(x i ,C{n)) ^ I* 1 ’ *2] [^1^2] 

Bin(A r - D (o iP )ni>(x il C(n)) 5 K{n)) where A r x>(o,p)nx>(xi,c(n)) 

denotes the number of nodes in 22(0, p) IT 2T(xj, C(n)). 

As an immediate consequence of Lemma 0 
N D {o,p)nv(x u c(n)) > f|r>(0 ,p) IT D(xi,C(n))\ with 
a probability 1 — 0(n~ 2 ). Then, conditionally to this 
relation, we have Bin(iVx,(o i p)nc(x i ,c(n)), K(n)) > 

^\V{0, p)r\'D(x i ,C(n))\^Y^ with a probability 1 — 0(n -7 ) 
with 7 > 0, as it can be immediately shown by applying (01. 

As a consequence, our algorithm successfully identi¬ 
fies almost all good pairs in 15(0, p\) \ 22(0 ,p) (i.e., 
^D(o, P )ni5(x i ,c( n ))-o(^-D(o,p)nc(x i ,c(r l )))) with a probability 
1 — 0(n _1 ), again, as a consequence of (0 when applied to 
the number of matched nodes in 2T(0,p). 

Next, consider a bad pair [*1,^2] whose nodes i and 
j are located respectively in Xj and Xj, with x, : | = 
Pi and |xj| = pj. Let £>i(/ii) = V{x il 2 hl+1 C(n)) \ 
T>(xi,2 hl C(n) for h\ > l with 2?i(0) = V(xi,2C(n)) 
and T> 2 (h- 2 ) = T>(xj ,2 h2+1 C(n)) \ T>(xi,2 h2 C(n)) with 
T> 2 ( 0 ) = T>(xj,2C(n)) (see Figure fl2l. 

Let C(hi,h 2 ) = V(0,p) D T>i(hi) IT 2? 2 (Ii 2 ) for h\ > 1 
and h 2 > 0. We have: N [iuh] = J2izv(o, P ) Y [h, h][hh]\ = 
J2h 2 !,h 2 ) Y[ii, jvWhh] < 

E hl Eh 2 Bin(N Cihl , h2) ,K*(n)2-^+^). 


Now, C(hi,h 2 ) is a subset of both C(hi) = 22(0 , p) IT 
T>i{hi) and C(h 2 ) = 22(0 ,p) nV 2 (h 2 ). Thus N c{huha) < 
min(fVc(/i 1 ), Nc(h 2 ))- addition, by construction: 22(0, p) IT 
T>i(hi) = 0 if hi < hf n flog 2 (l + %$)], or > 

\ log 2 (1 + 0^)1 • Similarly, 22(0, p) nP 2 N = 0 if < 

^ in rio g2 (i + g^yi, or h 2 > hr x = rio g2 (i + 


Hence, 

h™ ax h™ ax 

N [ii,h] < J 2 J 2 Bin ( min ( Nc (hi)’ N C(h 2 )), 

h™ in hf in 

K 2 (n)2~^ hl+h2) ) (12) 

Now C(h\) is by construction a subset of T>{ 0, p) as well as of 
2?i(/ii), thus Nc(hi) < min(iV 2 3(o i p),JV Dl ^ 1 )) and similarly 
N C (h 2 ) < min {NvQ'p^Nvtfa)), thus: 

h™ ax h™ ax 

Tlf in /l” in 

K 2 (n)2~^ hl+h2) ). (13) 


Note that |22(0,p)| = np 2 while \T>i{hi)\ < 

7T 2 2 ( hl+1 )C 2 (n), and, similarly, |22 2 (/i 2 )| < Tr2 2 ( h2+1 ^ C 2 (n). 
As a consequence, since all these defined regions are larger 
than C 2 (n), from Lemma 0 we have that, uniformly on 
hi and /i 2 , the number of nodes in these regions is not 
larger than 2 n times the volume of the regions themselves. 
I.e., JVd(o,p) < 2mrp 2 , N- Dl ( hl ' ) > 2nir2 2 ( hl+1 ' 1 C 2 (n) and 
Nv 2 {h 2 ) < 2ri7r(2 2 ^ ft ' 2+1 ^C 2 (n) with a probability 1 —0(n 2 ). 
Thus, by construction: 


/i” ax 

nN [iuj2] ] < E E 

h™ in h™ in 

E[Bin(2nC 2 (n) mm(2 2 ^ hl+h2 ^ +3 , 

^T^rr)’ K 2 (n)2-^ hl+h2) )} + 0(n _1 ). (15) 
C A (n) 


m[n,h]} <EE E[Bin(2 nC 2 (n) 

h™ in h™ in 

min (2 2 ( hl+1 \2 2 ( h2+1 \n-£—)) ,K 2 {n)2-^ hl+h2 ">)]{l-0(n- 2 )) 

V C z {n) J 

+ nO(n~ 2 ) (14) 

Furthermore, min(2 2 ( /ll+1 ^), 2 2 ( h2+1 )) < 2 2( - hl+h2 ' >+3 . Then 
we can rewrite the previous expression as: 
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Now, if 2 2 (^i lin+ ^2 lin ) +3 
E [ N [ii,j2)] < 


< 7r 


C 2 (n) ’ 


we can bound: 


Since for large n we have clogn • w(l) > 31og?r, it turns 
out that every bad pair [*i, J2], regardless the position of its 
vertices, reaches threshold r with a probability 0{n~ 3 ). By 
applying the union bound, we can claim that jointly all of 
EE E[Bm(n2 3(?ll+,l2)+4 ,A' 2 (n)2- /3 ^ 1+?l2) )]+0(n' 1 ) = such pairs will remain below the threshold r with a probability 


umin Lmin 


0(n 


1 max 7 max 


nC 2 (n) J2 E 2 4 +' ll(2 -«+' l2(2 -«K 2 (n) + 0(n- 1 ) < 

h min ^min 

OO OO 

nC 2 (n) 2 A+{ ~ h r^r( 2 - %2(n) EE 2^-1 +/12 (2-/5) 


Algorithm 1 The PGM algorithm 


+ 0{n~ 1 ) = 

2nC 2 (n)2 3+ ^ itl+h ^ 2 -^K 2 (n) 


o o 


1 - 22-0 


+0{n~ 1 ). 

(16) 


If, instead, 2 2 ( ft ” ln + ?i S lin )+ 3 > 

C z [n) 9 

we can bound: 


with similar arguments 


m [iuh] ] < 2mrp 2 K 2 (n)2 


-(hf in +h™ in )/3 




^min 


h^ in 


Observe that, in general, 

= 0(n[C(n)] 2 K 2 (n)) 

with E[iV[ ilj2 ]] = Q(nC 2 (n)K 2 (n)) only when h 
is bounded. As a consequence, the bad pair [*i, J2] will not 
reach threshold r = Q(nC 2 {n)K 2 {n)) w.h.p, as it can be 
immediately verified by applying Markov inequality. However, 
we need to show that jointly all bad pairs will remain below 
the threshold with a probability 1 — 0(n _1 ). We can prove 
this stronger property first by deriving a tighter bound for the 
probability that a specific pair reaches the threshold, and then 
by applying the union bound on all pairs. 

Considering again the bad pair [*1,^2], Nu lj2 ] can be 
rewritten as N [h>h] = T,i e n,i^ hev(o, P ) Y [iui 2 \[hh\\, 
i.e., as a sum of independent Bernoulli random variables. Thus, 
we can apply Chernoff inequality to bound its tail. Recalling 
that by construction r E[Aj u j 2 i], we have: 


8 : 

(17) 

10 : 

11 : 

12 : 


Ao — Bo — Ao(n), Zo — 0 

while At \ Z t ^ 0 do 

t = t T 1 

Randomly select a pair [* 1 , *2] G At-i\Z t ~i and add 
one mark to all neighbor pairs of [* 1; *2] in 

Let A B t be the set of all neighbor pairs of [*i,* 2 ] 
in V{Gi) whose mark counter has reached threshold r at 
time t. 

Construct set A^4 t C A Bt as follows. Order the pairs 
in A Bt in an arbitrary way, select them sequentially and 
test them for inclusion in A A t 

if the selected pair in A Bt has no conflicting pair in 
At- 1 or A At then 

Insert the pair in A At 
else 

Discard it 

Zt = Z t - 1U [*1, *2]. I3t = Bt- 1 UAB t , At = At— 1U 

aa 

return T = t, Zt = At 


The proof we propose complements the one provid ed in [8j , 

-3/1 1 


which holds only under the assumption p lni 
Here, we restrict to the case p m j n = O 


> 


—71 - 


With 


P(N [nj2] > r) < eM 


f E [ N [h,h]\ 


r( 1-log , 


TV 


(18) 


From the definition of r, it follows that r = cnC 2 (n)K(n) 
with c = > 0. Thus, 


logP((V [ilj2] > r) < cnC 2 (n)K(n) 
i-logErr -^-2 ){hf 


K(j 


hT n ) 


log 2 +Ci (19) 


reference to PGM algorithm reported in Figure [Q we define: 

• B t (Qr) as the set of pairs in V{Gt) that at time step t 

have already collected a least r marks. It is composed of 
good pairs and bad pairs B'^Qj); 

• AtiGr) as the set of matchable pairs at time t. Similarly 
to Bt(Gj), it comprises good pairs A' t (Gr) and bad pairs 
A”(Gt)- In general, A(!?t) and Bt(Gj) do not coincide 
as BtiGr) may include conflicting pairs that are not 
present in A((?t); 

• Z t (Gr) as the set of pairs that have been matched up to 
time t. By construction, \Z t \ = t, Vf. 

Next, we define Tq p . = min{fs.t. \At(G(n,p m i n )\ = t} 
and T Gpmax = minjf s.t. \A t (G(n t p maDl )\ = t}. By Theo¬ 
rem ??, we have that both Tn and Tn are equal to 
n — o(n ). Then inductively on t, Vf < min(T Gp , Tg ), 
w.h.p.: 

K{Gt)\ < |B"((G(n,p max ))| =0 (20) 

In (Pol l, the inequality descends by monotonicity of sets B” 
with respect to “< 8t ”. The following equality descends from 
Corollary 1 in || 8 ] applied to G Pmax . We remark that, under our 
assumption on and p lnax , we have to = T in Corollary 
1 in J 8 ), along with: 


where Ci is an opportune constant. By 

nC 2 (n)K{n ) > logn and K(n) = o((logn) -7 ) 

logP(7V [ilij2] >r)< -clogn • w(l). 


assumption, 

hence 


\A t {GT)\ =\B' t {G T )\ > 

( r \ ( d ) 

\B’ t (G(n,p m in))\ =\A t {G(n,p min ))\ > t. 


( 21 ) 
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In dm equality (a) is an immediate consequence of ( l20l ). 
inequality (b) holds by monotonicity of sets B' t with respect to 
“< st ”, while equality (c) descends from Theorem ??. Inequal¬ 
ity (d) descends from the fact that we assume t<T Gp . 

Thus, necessarily, A t (Gt) = T > min(?G Pmin , T Gpmax ) = 
n — o{n) and = 0- 
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